ENH: add dropna argument to pivot_table #4106

hayd · 2013-07-02T18:16:57Z

a = np.array(['foo', 'foo', 'foo', 'bar', 'bar', 'foo', 'foo'], dtype=object)
b = np.array(['one', 'one', 'two', 'one', 'two', 'two', 'two'], dtype=object)
c = np.array(['dull', 'dull', 'dull', 'dull', 'dull', 'shiny', 'shiny'], dtype=object)

In [11]: pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'], drop_na=False)
Out[11]:
b     one          two
c    dull  shiny  dull  shiny
a
bar     1      0     1      0
foo     2      0     1      2

Also same argument for pivot_table.

jtratner · 2013-07-02T22:55:44Z

Question: Why did you choose drop_na instead of dropna which is the name of the data frame method?

hayd · 2013-07-02T23:05:30Z

@jtratner whoops!

hayd · 2013-07-03T00:08:10Z

I've also realised that doing something like np.testing.assert_equal(m[:3], m) doesn't raise (I guess I need to use np.testing.assert_equal(m[:3].values, m.values)...

jtratner · 2013-07-03T00:10:09Z

@hayd assert_frame_equal? or assert_almost_equal?

jtratner · 2013-07-03T00:11:56Z

ah didn't realize it was a multiindex. (internal representation is an empty ndarray)

cpcloud · 2013-07-03T01:02:21Z

imo len(some_multiindex) should behave as if some_multiindex is a 2d array.

cpcloud · 2013-07-03T01:02:40Z

any reason why that shouldn't be the case?

jtratner · 2013-07-03T01:04:34Z

__len__ needs to match __iter__ (generally) - I need to look at it

jtratner · 2013-07-03T01:06:38Z

I think it needs to be:

def __len__(self):
    return len(self.levels) * len(self.labels)

jtratner · 2013-07-03T01:07:23Z

That way len(list(iter(multi_index))) == len(multi_index)

cpcloud · 2013-07-03T01:07:59Z

that doesn't match __iter__ since __iter__ is over each tuple. should be

def __len__(self):
    return len(self.values)

cpcloud · 2013-07-03T01:09:39Z

e.g., if there are 10 level and 10 labels then len will return 100 which makes sense as maybe a size attr like an ndarray, but really MultiIndex straddles the array of tuple interp and the 2d array interp

jtratner · 2013-07-03T01:22:34Z

Oh, I thought len(self.labels) * len(self.levels) was equal to
len(self.values)?

On Tue, Jul 2, 2013 at 9:09 PM, Phillip Cloud [email protected]:

e.g., if there are 10 level and 10 labels then len will return 100 which
makes sense as maybe a size attr like an ndarray, but really MultiIndexstraddles the array of tuple interp and the 2d array interp

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4106#issuecomment-20389878
.

jtratner · 2013-07-03T01:24:20Z

Nope, I'm wrong :P

On Tue, Jul 2, 2013 at 9:22 PM, Jeffrey Tratner
[email protected]:

Oh, I thought len(self.labels) * len(self.levels) was equal to
len(self.values)?

On Tue, Jul 2, 2013 at 9:09 PM, Phillip Cloud [email protected]:

e.g., if there are 10 level and 10 labels then len will return 100 which
makes sense as maybe a size attr like an ndarray, but really MultiIndexstraddles the array of tuple interp and the 2d array interp

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4106#issuecomment-20389878
.

hayd · 2013-07-03T07:23:22Z

@cpcloud no reason really... I was away from the internet when wrote the test case and forgot what it was.

hayd · 2013-07-06T21:06:08Z

Converted test case to example from previous issue.

cartesian_product is slightly slower (when passing to MultIndex) for small inputs but considerably faster for large ones:

In [23]: %timeit pd.MultiIndex.from_tuples(list(product(list('ABC'), [1, 2])))1000 loops, best of 3: 399 us per loop

In [24]: %timeit pd.MultiIndex.from_arrays(cartesian_product([list('ABC'), [1, 2]]))
1000 loops, best of 3: 541 us per loop

X = list('ABC' * 100)
Y = [1,2] * 100

In [27]: %timeit pd.MultiIndex.from_arrays(cartesian_product([X, Y]))
100 loops, best of 3: 8.1 ms per loop

In [28]: %timeit pd.MultiIndex.from_tuples(list(product(X, Y)))
10 loops, best of 3: 21.5 ms per loop

jreback · 2013-07-10T13:10:17Z

merge?

ENH: add dropna argument to pivot_table

hayd added 2 commits July 6, 2013 20:59

ENH add drop_na argument to pivot_table

2d63a71

CLN make cartesian product faster

fefa2bf

hayd added a commit that referenced this pull request Jul 10, 2013

Merge pull request #4106 from hayd/cartesian_crosstab

89b2f83

ENH: add dropna argument to pivot_table

hayd merged commit 89b2f83 into pandas-dev:master Jul 10, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add dropna argument to pivot_table #4106

ENH: add dropna argument to pivot_table #4106

hayd commented Jul 2, 2013

jtratner commented Jul 2, 2013

hayd commented Jul 2, 2013

hayd commented Jul 3, 2013

jtratner commented Jul 3, 2013

jtratner commented Jul 3, 2013

cpcloud commented Jul 3, 2013

cpcloud commented Jul 3, 2013

jtratner commented Jul 3, 2013

jtratner commented Jul 3, 2013

jtratner commented Jul 3, 2013

cpcloud commented Jul 3, 2013

cpcloud commented Jul 3, 2013

jtratner commented Jul 3, 2013

jtratner commented Jul 3, 2013

hayd commented Jul 3, 2013

hayd commented Jul 6, 2013

jreback commented Jul 10, 2013

ENH: add dropna argument to pivot_table #4106

ENH: add dropna argument to pivot_table #4106

Conversation

hayd commented Jul 2, 2013

jtratner commented Jul 2, 2013

hayd commented Jul 2, 2013

hayd commented Jul 3, 2013

jtratner commented Jul 3, 2013

jtratner commented Jul 3, 2013

cpcloud commented Jul 3, 2013

cpcloud commented Jul 3, 2013

jtratner commented Jul 3, 2013

jtratner commented Jul 3, 2013

jtratner commented Jul 3, 2013

cpcloud commented Jul 3, 2013

cpcloud commented Jul 3, 2013

jtratner commented Jul 3, 2013

jtratner commented Jul 3, 2013

hayd commented Jul 3, 2013

hayd commented Jul 6, 2013

jreback commented Jul 10, 2013